Task: Investigate Problem And Identify Root Cause

The focus during problem investigation is to diagnose the root cause of the problem and arrive at a workaround to restore the service as quickly as possible. The speed and nature of the investigation widely depends on the impact, severity and urgency of the problem. Problem Management must ensure that right level of expertise and resources are available to execute this task. The time taken to arrive at the root cause and the solution must be in line with the defined Service Level Agreements (SLA).

The Configuration Management Database (CMDB) should be leveraged to determine the level of impact and diagnose the exact point of failure. Also, it is possible to investigate on which context the problem occurred and thereby find a solution in a faster way. It is often needed to recreate the failure to understand what has gone wrong, and then try various ways of finding the most appropriate and cost-effective resolution to address the problem. To proceed effectively without causing further disruption to users, a "test" environment can be used to recreate the problems.

Root Cause Analysis (RCA) is based on several key analytical concepts and principals including establishing success conditions, cause/effect relationships, data quality, risk analysis etc. Some of the different problem-solving techniques used are chronological analysis, pain value analysis, Kempner and Tregoe, brainstorming, Ishikawa diagrams, pareto analysis, etc.

Root Cause Analysis must be formally documented as part of Service Engagement document. This document should contain the details of root cause investigation, contributing factors, observations, proposed solutions as well as the preventive and corrective actions. Root Cause Analysis results are shared with Client for further discussion and approval.